大语言模型架构的演进：从 BERT 到 GPT 与 T5

Transformer 架构的三重范式

大型语言模型的演进标志着一场范式转变：从特定任务模型转向“统一预训练”模式，即单一架构可适应多种自然语言处理需求。

这一转变的核心是自注意力机制，它使模型能够衡量序列中不同词语的重要性：

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

1. 编码器仅用型（BERT）

机制：掩码语言建模（MLM）。
行为特征：双向上下文；模型能一次性“看到”整个句子，以预测被遮蔽的词语。
适用于：自然语言理解（NLU）、情感分析和命名实体识别（NER）。

2. 解码器仅用型（GPT）

机制：自回归建模。
行为特征：从左到右的处理方式；严格基于前序上下文预测下一个标记（因果掩码）。
适用于：自然语言生成（NLG）与创意写作。这构成了 GPT-4、Llama 3 等现代大语言模型的基础。

3. 编码器-解码器型（T5）

机制：文本到文本迁移变换器（Text-to-Text Transfer Transformer）。
行为特征：编码器将输入字符串转换为密集表示，解码器则生成目标字符串。
适用于：翻译、摘要生成及对等任务。

关键洞察：解码器主导地位

业界已基本集中于解码器仅用型架构，因其更优的缩放定律和在零样本场景下的涌现推理能力。

显存上下文窗口影响

在解码器仅用模型中，KV 缓存随序列长度线性增长。一个 10 万上下文窗口所需的显存远超 8 千窗口，因此在未量化的情况下，本地部署长上下文模型极具挑战性。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?

Decoders scale more effectively for generative tasks and follow-up instructions via next-token prediction.

Encoders cannot process text bidirectionally.

Decoders require less training data for classification tasks.

Encoders are incompatible with the Self-Attention mechanism.

Question 2

Which architecture treats every NLP task as a "text-to-text" problem?

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5)

Recurrent Neural Networks (RNN)

Challenge: Architectural Bottlenecks

Analyze deployment constraints based on architecture.

If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.

Step 1

Identify the architectural bottleneck regarding context processing.

Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.

Step 2

Justify the preference using Scaling Laws.

Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.